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Abstract 

The mplot package provides an easy to use implementation of model stability and 
variable inclusion plots (Muller and Welsh 2010; Murray, Heritier, and Muller 2013) as well 
as the adaptive fence (Jiang, Rao, Gu, and Nguyen 2008; Jiang, Nguyen, and Rao 2009) for 
linear and generalised linear models. We provide a number of innovations on the standard 
procedures and address many practical implementation issues including the addition of 
redundant variables, interactive visualisations and approximating logistic models with 
linear models. An option is provided that combines our bootstrap approach with glmnet 
for higher dimensional models. The plots and graphical user interface leverage state of the 
art web technologies to facilitate interaction with the results. The speed of implementation 
comes from the leaps package and cross-platform multicore support. 

Keywords', model selection, variable selection, linear models, mixed models, generalised linear 
models, fence, R. 


1. Graphical tools for model selection 

In this article we introduce the mplot package in R, which provides a suite of interactive 
visualisations and model summary statistics for researchers to use to better inform the variable 
selection process (Tarr, Muller, and Welsh 2016; R Core Team 2015). The methods we provide 
rely heavily on various bootstrap techniques to give an indication of the stability of selecting a 
given model or variable and even though not done here, could be implemented with resampling 
methods other than the bootstrap, for example cross validation. The ‘m’ in mplot stands for 
model selection/building and we anticipate that in future more graphs and methods will be 
added to the package to further aid better and more stable building of regression models. The 
intention is to encourage researchers to engage more closely with the model selection process, 
allowing them to pair their experience and domain specific knowledge with comprehensive 
summaries of the relative importance of various statistical models. 

Two major challenges in model building are the vast number of models to choose from and the 
myriad of ways to do so. Standard approaches include stepwise variable selection techniques 
and more recently the lasso. A common issue with these and other methods is their instability, 
that is, the tendency for small changes in the data to lead to the selection of different models. 

An early and significant contribution to the use of bootstrap model selection is Shao (1996) 
who showed that carefully selecting m in an m-out-of-n bootstrap drives the theoretical prop¬ 
erties of the model selector. Muller and Welsh (2005, 2009) modified and generalised Shao’s 
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m-out-of-n bootstrap model selection method to robust settings, first in linear regression and 
then in generalised linear models. The bootstrap is also used in regression models that are 
not yet covered by the mplot package, such as mixed models (e.g., Shang and Cavanaugh 
2008) or partially linear models (e.g., Muller and Vial 2009) as well as for the selection of 
tuning parameters in regularisation methods (e.g.. Park, Sakaori, and Konishi 2014). 

Assume that we have n independent observations y = (yi,... ,y„)"'~ and an n x p full rank 
design matrix X whose columns are indexed by 1,..., p. Let a denote any subset of pa distinct 
elements from {1,... ,p}. Let be the corresponding n x pa design matrix and xA denote 
the ith row of Xq,. 

The mplot package focuses specifically on linear and generalised linear models (GLM). In 
the context of GLMs, a model a for the relationship between the response y and the design 
matrix Xq is specified by 

E(y) = /i(Xq/3q), and var(y) = o-^u(/i(Xq/3q)), (1) 

where (3a is an unknown p^-vector of regression parameters and a is an unknown scale pa¬ 
rameter. Here E(-) and var(-) denote the expected value and variance of a random variable, 
h is the inverse of the usual link function and both h and v are assumed known. When h is 
the identity and v{-) = 1, we recover the standard linear model. 

The purpose of model selection is to choose one or more models a from a set of candidate 
models, which may be the set of all models A or a reduced model set (obtained, for example, 
using any initial screening method). Many model selection procedures assess model fit using 
the generalised information criterion, 

GIC(a,A) = g(a) + ApQ. (2) 

The Q{ot) component is a measure of “description loss” or “lack of fit”, a function that describes 
how well a model fits the data, for example, the residual sum of squares or —2 x log-likelihood. 
The number of independent regression model parameters, p^, is a measure of “model com¬ 
plexity”. The penalty multiplier. A, determines the properties of the model selection criterion 
(Muller, Scealy, and Welsh 2013; Muller and Welsh 2010). Special cases, when Q{a) = 
—2 X log-likelihood(Q;), include the AIC with A = 2, BIG with A = log(n) and more generally 
the generalised information criterion (GIG) with A G M (Konishi and Kitagawa 1996). 

The mplot package currently implements “variable inclusion plots”, “model stability plots” and 
a model selection procedure inspired by the adaptive fence of Jiang et al. (2008). Variable 
inclusion plots were introduced independently by Muller and Welsh (2010) and Meinshausen 
and Biihlmann (2010). The idea is that the best model is selected over a range of values of 
the penalty multiplier A and the results are visualised on a plot which shows how often each 
variable is included in the best model. These types of plots have previously been referred to 
as stability paths, model selection curves and most recently variable inclusion plots (VIPs) in 
Murray et al. (2013). An alternative to penalising for the number of variables in a model is 
to assess the fit of models within each model size. This is the approach taken in our model 
stability plots where searches are performed over a number of bootstrap replications and the 
best models for each size are tallied. The rationale is that if there exists a “correct” model 
of a particular model size it will be selected overwhelmingly more often than other models 
of the same size. Finally, the adaptive fence was introduced by Jiang et al. (2008) to select 
mixed models. This is the first time code has been made available to implement the adaptive 
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fence and the first time the adaptive fence has been applied to linear and generalised linear 
models. 

This article introduces three data examples that each highlight different aspects of the graph¬ 
ical methods made available by mplot. Sections 2-5 are based on a motivating example where 
the true data generating model is known. We use this example to highlight one of the classical 
failings of stepwise procedures before introducing variable inclusion plots and model stability 
plots through the vis () function in Section 3. Our implementation of the adaptive fence with 
the af 0 function is presented in Section 4. 

For all methods, we provide publication quality classical plot methods using base R graphics 
as well as interactive plots using the googleVis package (Gesmann and de Castillo 2011). In 
Section 5, we show how to add further utility to these plot methods by packaging the results 
in a shiny web interface which facilitates a high degree of interactivity (Chang, Cheng, Allaire, 
Xie, and McPherson 2015). 

In Section 6 we show computing times in a simulation study, varying the number of variables 
from 5 to 50; we further illustrate the advantage of using multiple core technology. We then 
show with two applied examples the practical merit of our graphical tools in Section 7. 

To conclude, we highlight in Section 8 the key contributions of the three data examples and 
make some final brief remarks. 


2. Illustrative example 

We will present three examples to help illustrate the methods provided by the mplot package. 
Two real data sets are presented as case studies in Section 7. The first of these is a subset of 
the diabetes data set used in Efron, Hastie, Johnstone, and Tibshirani (2004) which has 10 
explanatory variables and a continuous dependent variable, a measure of disease progression, 
suitable for use in a linear regression model. The second is a binomial regression example 
from Hosmer and Lemeshow (1989) concerning low birth weight. 

The artificially generated data set was originally designed to emphasise statistical deficiencies 
in stepwise procedures, but here it will be used to highlight the utility of the various procedures 
and plots provided by mplot. The data set and details of how it was generated are provided 
with the mplot package. 

R> install.packages("mplot") 

R> dataC'artificialeg", package = "mplot") 

R> helpC'artificialeg", package = "mplot") 

A scatterplot matrix of the data and the estimated pairwise correlations are given in Figure 1. 
There are no outliers and we have not positioned the observations in a subspace of the artifi¬ 
cially generated data set. All variables, while related, originate from a Gaussian distribution. 
Fitting the full model yields no individually significant variables. 

R> require("mplot") 

R> dataC'artificialeg") 

R> full.model = lm(y ~ ., data = artificialeg) 

R> round(summary(full.model)$coef, 2) 
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Figure 1: Scatterplot matrix of the artificially generated data set with estimated correlations 
in the upper right triangle. The true data generating process for the dependent variable is 
y = 0.6x8 + e where e ~ AA(0, 2^). 
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Performing default stepwise variable selection yields a model with all explanatory variables 
except X8. As an aside, the dramatic changes in the p-values indicate that there is sub¬ 
stantial interdependence between the explanatory variables even though none of the pairwise 
correlations in Figure 1 are particularly extreme. 


R> step.model = step(full.model, trace = 0) 
R> round(summary(step.model)$coef, 2) 
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The true data generating process is, y = 0.6 xg + e, where s r-u AA(0,2^). The bivariate 
regression of y on xg is the more desirable model, not just because it is the true model 
representing the data generating process, but it is also more parsimonious with essentially the 
same residual variance as the larger model chosen by the stepwise procedure. This example 
illustrates a key statistical failing of stepwise model selection procedures, in that they only 
explore a subset of the model space so are inherently susceptible to local minima in the 
information criterion (Harrell 2001). 

Perhaps the real problem with of stepwise methods is that they allow researchers to transfer all 
responsibility for model selection to a computer and not put any real thought into the model 
selection process. This is an issue that is also shared, to a certain extent with more recent 
model selection procedures based on regularisation such as the lasso and least angle regression 
(Tibshirani 1996; Tibshirani, Johnstone, Hastie, and Efron 2004), where attention focusses 
only on those models that are identified by the path taken through the model space. In 
the lasso, as the tuning parameter A is varied from zero to oo, different regression parameters 
remain non-zero, thus generating a path through the set of possible regression models, starting 
with the largest “optimal” model when A = 0 to the smallest possible model when A = oo, 
typically the null model because the intercept is not penalised. The lasso selects that model 
on the lasso path at a single A value, that minimises one of the many possible criteria (such 
as 5-fold cross-validation, or the prediction error) or by determining the model on the lasso 
path that minimises an information criterion (for example BIC). 

An alternative to stepwise or regularisation procedures is to perform exhaustive searches 
of the model space. While exhaustive searches avoid the issue of local minima, they are 
computationally expensive, growing exponentially in the number of variables p, with more 
than a thousand models when p = 10 and a million when p = 20. The methods provided in the 
mplot package and described in the remainder of the article go beyond stepwise procedures by 
incorporating exhaustive searches where feasible and using resampling techniques to provide 
an indication of the stability of the selected model. The mplot package can feasibly handle up 
to 50 variables in linear regression models and a similar number for logistic regression models 
when an appropriate transformation (described in Section 7.2) is implemented. 


3. Model stability and variable inclusion plots 

The main contributions of the mplot package are model stability plots and variable inclusion 
plots, implemented through the vis() function, and the simplihed adaptive fence for linear 
and generalised linear models via the af () function which is discussed in Section 4. 
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Figure 2: Results of calls to plot (vis. art, interactive = FALSE) with additional argu¬ 
ments which = "Ivk" in the top left, which = "boot" in the top right and which = "vip" 
down the bottom. 

Our methods generate large amounts of raw data about the fitted models. While the print 
and summary output from both functions provide suggestions as to which models appear to 
be performing best, it is not our intention to have researchers simply read off the “best” model 
from the output. The primary purpose of these techniques is to help inform a researcher’s 
model selection choice. As such, the real value in using these functions is in the extensive 
plot methods provided that help visualise the results and get new insights. This is reflected 
in the choice of name vis, short for visualise, as this is the ultimate goal - to visualise the 
stability of the model selection process. 

3.1. Model stability plots 

In order to generate model stability and variable inclusion plots, the first step is to generate a 
vis object using the vis() function. To generate a vis object for the artificial data example 
the fitted full model object along with some optional arguments are passed to the vis() 
function. 

R> Im.art = lm(y ~ data = artificialeg) 

R> vis.art = vis(Im.art, B = 150, redundant = TRUE, nhest = "all") 

The B = 150 argument provided to the vis() function tells us that we want to perform 150 
bootstrap replications. See Murray et al. (2013) for more detail on the use of exponential 


















Garth Tarr, Samuel Mueller, Alan H. Welsh 


7 


weights in bootstrap model selection. Specifying redundant = TRUE is unnecessary, as it is 
the default option; it ensures that an extra variable, randomly generated from a standard 
normal distribution and hence completely unrelated to the true data generating process, is 
added to the full model. This extra redundant variable can be used as a baseline comparison in 
the variable inclusion plots. Finally, the nbest argument controls how many models with the 
smallest Q{a) for each model size k = 1,... ,p are recorded. It can take an integer argument 
or specifying nbest = "all" ensures that all possible models are displayed when the plot 
methods is called, as shown in the top left panel of Figure 2. Typically researchers do not 
need to visualise the entire model space and in problems with larger numbers of candidate 
variables it is impractical to store and plot results for all models. The default behaviour of the 
vis() function is to set nbest = 5, essentially highlighting the maximum enveloping lower 
convex curve of Murray et al. (2013). 

The simplest visualisation of the model space is to plot a measure of description loss against 
model complexity for all possible models, a special implementation is the Mallows Cp plot 
(Mallows 2000). This is done using the argument which = "Ivk" to the plot function applied 
to a vis object. The string "Ivk" is short for loss versus k, the dimension of the model. 

R> plot(vis.art, interactive = FALSE, highlight = "x8", which = "Ivk") 

The result of this function can be found in the top left panel of Figure 2. The highlight 
argument is used to differentiate models that contain a particular variable from those that 
do not. This is an implementation of the “enriched scatter plot” of Murray et al. (2013). 
There is a clear separation between models that contain xg and those that do not, that is, 
all triangles are clustered towards the bottom with the circles above in a separate cluster. 
There is no similar separation for the other explanatory variables (not shown). These results 
strongly suggest that xg is the single most important variable. For clarity the points have been 
jittered slightly along the horizontal axis, though the model sizes remain clearly differentiated. 

Rather than performing a single pass over the model space and plotting the description loss 
against model size, a more nuanced and discerning approach is to use a (exponential weighted) 
bootstrap to determine how often various models achieve the minimal loss for each model size. 
The advantage of the bootstrap approach is that it gives a measure of model stability for each 
model size as promoted by Meinshausen and Biihlmann (2010), Muller and Welsh (2010) and 
Murray et al. (2013). 

The weighted bootstrap has two key-benefits over the residual or nonparametric bootstrap: 
First, the weighted bootstrap always yields observable responses which is particularly relevant 
when these observable values are restricted to be integers (as in many generalized linear 
models), or, when y values are naturally bounded, say to be observed on the interval 0 to 1; 
Second, the weighted bootstrap does not suffer from separation issues that regularly occur in 
logistic and other models. The pairs bootstrap also yields observable responses and can be 
thought of as a special (boundary) case of the weighted bootstrap where some weights are 
allowed to be exactly zero, which can create a separation issue in logistic models. Therefore, 
we have chosen to implement the weighted bootstrap because it is a simple, elegant method 
that appears to work well. Specifically, we utilise the exponential weighted bootstrap where 
the observations are reweighted with weights drawn from an exponential distribution with 
mean 1 (see Murray et al. (2013) for more detail). 

To visualise the results of the exponential weighted bootstrap, the which = "boot" argument 
needs to be passed to the plot call on a vis object. The highlight argument can again be used 
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to distinguish between models with and without a particular variable. Each circle represents a 
model with a non-zero bootstrap probability, that is, each model that was selected as the best 
model of a particular dimension in at least one bootstrap replication. Furthermore, the area 
of each circle is proportional to the corresponding model’s bootstrapped selection probability. 

Figure 2 is an example of a model stability plot for the artificial data set. The null model, the 
full model and the simple linear regression of y on xg all have bootstrap probabilities equal to 
one. While there are alternatives to the null and full model their inclusion in the plot serves 
two main purposes. Firstly, to gauge the potential range in description loss and secondly to 
provide a baseline against which to compare other circles to see if any approach a similar size, 
which would indicate that those are dominant models of a given model dimension. In Figure 2, 
for model dimensions of between three and ten, there are no clearly dominant models, that 
is, within each model size there are no models that are selected much more commonly than 
the alternatives. 

A print method is available for vis objects which prints the model formula, log-likelihood and 
proportion of times that a given model was selected as the “best” model within each model 
size. The default minimum probability of a model being selected before it gets printed is 0.3, 
though this can be customised by passing a min.prob argument to the print function. 

fi> print(vis.art, min.prob = 0.25) 

name prob logLikelihood 


y~l 1.00 -135.33 

y~x8 1.00 -105.72 

y~x4+x8 0.40 -103.63 

y~xl+x8 0.27 -104.47 

y~xl+x2+x3+x4+x5+x6+x7+x9 0.26 -100.63 

y~xl+x2+x3+x4+x5+x6+x7+x9+RV 0.33 -100.51 


The output above, reinforces what we know from the top right panel of Figure 2. The null 
model is always selected and in models of size two a regression of y on xg is always selected. In 
models of size three the two most commonly selected models are y~x4+x8, which was selected 
40% of the time and y~xl+x8 selected in 27% of bootstrap replications. Interestingly, in 
models of size nine and ten, the most commonly selected models do not contain xg, these are 
shown as blue circles in the plot. We will see in the next section that this phenomenon is 
related to the failure of stepwise variable selection with this data set. 

3.2. Variable inclusion plots 

Rather than visualising a loss measure against model size, it can be instructive to consider 
which variables are present in the overall “best” model over a set of bootstrap replications. To 
facilitate comparison between models of different sizes we use the GIC, equation (2), which 
includes a penalty term for the number of variables in each model. 

Using the same exponential weighted bootstrap replications as in the model selection plots, 
we have a set of B bootstrap replications and for each model size we know which model has 
the smallest description loss. This information is used to determine which model minimises 
the GIC over a range of values of the penalty parameter. A, in each bootstrap sample. For 


Garth Tarr, Samuel Mueller, Alan H. Welsh 


9 


each value of A, we extract the variables present in the “best” models over the B bootstrap 
replications and calculate the corresponding bootstrap probabilities that a given variable is 
present. These calculations are visualised in a variable inclusion plot (VIP) as introduced 
by Muller and Welsh (2010) and Murray et al. (2013). The VIP shows empirical inclusion 
probabilities as a function of the penalty multiplier A. The probabilities are calculated by 
observing how often each variable is retained in B exponential weighted bootstrap replications. 
Specifically, for each bootstrap sample b = 1,..., B and each penalty multiplier A, the chosen 
model, a\ G A, is that which achieves the smallest GIC(a, A; w;,) = Q^{a) + \pa, where wj, 
is the n-vector of independent and identically distributed exponential weights (we refer to 
Section 2.5 in Murray et al. (2013) for more information on the weighted bootstrap). The 
inclusion probability for variable Xj is estimated by B~^ ^ where I{j G a\} is 

one if Xj is in the final model and zero otherwise. Following Murray et al. (2013), the default 
range of A values is A G [0, 2 log(n)] as this includes most standard values used for the penalty 
parameter. 

The example shown in the bottom panel of Figure 2 is obtained using the which = "vip" 
argument to the plot function. As expected, when the penalty parameter is equal to zero, all 
variables are included in the model; the full model achieves the lowest description loss, and 
hence minimises the GIG when there is no penalisation. As the penalty parameter increases, 
the inclusion probabilities for individual variables typically decrease as more parsimonious 
models are preferred. In the present example, the inclusion probabilities for the xg variable 
exhibit a sharp decrease at low levels of the penalty parameter, but then increase steadily 
as a more parsimonious model is sought. This pattern helps to explain why stepwise model 
selection chose the larger model with all the variables except xg - there exists a local minimum. 
Hence, for large models the inclusion of xg adds no additional value over having all the other 
explanatory variables in the model. 

It is often instructive to visualise how the inclusion probabilities change over the range of 
penalty parameters. The ordering of the variables in the legend corresponds to their average 
inclusion probability over the whole range of penalty values. We have also added an indepen¬ 
dent standard Gaussian random variable to the model matrix as a redundant variable (RV). 
This provides a baseline to help determine which inclusion probabilities are “significant” in 
the sense that they exhibit a different behaviour to the RV curve. Variables with inclusion 
probabilities near or below the RV curve can be considered to have been included by chance. 
To summarise, VIPs continue the model stability theme. Rather than simply using a single 
penalty parameter associated with a particular information criterion, for example the AIG 
with A = 2, our implementation of VIPs adds considerable value by allowing us to learn from 
a range of penalty parameters. Furthermore, we are able to see which variables are most often 
included over a number of bootstrap samples. An alternative approach to assessing model 
stability, the simplified adaptive fence, is introduced in the next section. 


4. The simplified adaptive fence 


The fence, first introduced by Jiang et al. (2008), is built around the inequality 

Q{a) - Q{af) < c, 


where Q is an empirical measure of description loss, a is a candidate model and aj is the 
baseline, “full” model. The procedure attempts to isolate a set of “correct models” that satisfy 
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the inequality. A model a*, is described as “within the fence” if Q{c(*) — Q{af) < c. From 
the set of models within the fence, the one with minimum dimension is considered optimal. If 
there are multiple models within the fence at the minimum dimension, then the model with 
the smallest Q{a) is selected. For a recent review of the fence and related methods, see Jiang 
(2014). 

The implementation we provide in the mplot package is inspired by the simplified adaptive 
fence proposed by Jiang et al. (2009), which represents a significant advance over the original 
fence method proposed by Jiang et al. (2008). The key difference is that the parameter c is 
not fixed at a certain value, but is instead adaptively chosen. Simulation results have shown 
that the adaptive method improves the finite sample performance of the fence, see Jiang et al. 
(2008, 2009). 

The adaptive fence procedure entails bootstrapping over a range of values of the parameter 
c. For each value of c a parametric bootstrap is performed under af. For each bootstrap 
sample we identify the smallest model inside the fence, d{c). Jiang et al. (2009) suggest that 
if there is more than one model, choose the one with the smallest Q{a). Define the empirical 
probability of selecting model a for a given value of c as p*{c,a) = P*{d{c) = a}. Hence, 
if B bootstrap replications are performed, p*{c,a) is the proportion of times that model a 
is selected. Finally, define an overall selection probability, p*(c) = maxQ,g_ 4 p*(c, a) and plot 
p*{c) against c to find the first peak. The value of c at the first peak, c*, is then used with 
the standard fence procedure on the original data. 

Our implementation is provided through the af () function and associated plot methods. An 
example with the artificial data set is given in Figure 3 which is generated using the following 
code. 

fi> af.art = af(Im.art, B = 150, n.c = 50) 

R> plot(af.art, interactive = FALSE, best.only = TRUE) 

The arguments indicate that we perform B = 150 bootstrap resamples, over a grid of 50 
values of the parameter c. In this example, there is only one peak, and the choice of c* =21.1 
is clear. 

One might expect that there should be a peak corresponding to the full model at c = 0, but 
this is avoided by the inclusion of at least one redundant variable. Any model that includes 
the redundant variable is known to not be a “true” model and hence is not included in the 
calculation of p*{c). This issue was first identified and addressed by Jiang et al. (2009). 

There are a number of key differences between our implementation and the method proposed 
by Jiang et al. (2009). Perhaps the most fundamental difference is in the philosophy underlying 
our implementation. Our approach is more closely aligned with the concept of model stability 
than with trying to pick a single “best” model. This can be seen through the plot methods 
we provide. Instead of simply using the plots to identify the first peak, we add a legend that 
highlights which models were the most frequently selected for each parameter value, that is, 
for each c value we identify which model gave rise to the p* (c) value. In this way, researchers 
can ascertain if there are regions of stability for various models. In the example given in 
Figure 3, there is no need to even define a c* value, it is obvious from the plot that there is 
only one viable candidate model, a regression of y on xs. 

Our approach considers not just the best model of a given model size, but also allows users to 
view a plot that takes into account the possibility that more than one model of a given model 


Garth Tarr, Samuel Mueller, Alan H. Welsh 


11 


• 1 • x1 + x4 + x8 • x8 

• x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 x3 + x4 + x8 

• x1 + x2 + x3 + x4 + x5 + x6 + x7 + x9 x4 + x8 


• 1 x3 + x4 + x8 

• x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 • x8 

• x1 + x2 + x3 + x4 + x5 + x6 + x7 + x9 


1.00 

0.75 
VO.50 
0.25 

0.00 

0 10 20 30 40 0 10 20 30 40 

c c 



1.00- 

•• ••••.. 

••• 


••• 

• • 

• 

0.75- 

•• 

• 




• • 


• • • 

• ••• 

q.0.50- 

• •• 

• 


• • 

• 


• •• 

• 

0.25- 

• 

• 


• 

• 


• 

• .••• 


*••••* 


o.oo- 



Figure 3: Result of a call to plot (af .art, interactive = FALSE) with additional argu¬ 
ments best. only = TRUE on the left and best. only = FALSE on the right. The more rapid 
decay after the xg model is typical of using best. only = FALSE where the troughs between 
candidate/dominant models are more pronounced. 


size is within the fence. The best. only = FALSE option when plotting the results of the 
adaptive fence is a modification of the adaptive fence procedure which considers all models 
of a particular size that are within the fence when calculating the p*{c) values. In particular, 
for each value of c and for each bootstrap replication, if a candidate model is found inside 
the fence, then we look to see if there are any other models of the same size that are also 
within the fence. If no other models of the same size are inside the fence, then that model 
is allocated a weight of 1. If there are two models inside the fence, then the best model is 
allocated a weight of 1/2. If three models are inside the fence, the best model gets a weight 
of 1/3, and so on. After B bootstrap replications, we aggregate the weights by summing 
over the various models. The p* (c) value is the maximum aggregated weight divided by the 
number of bootstrap replications. This correction penalises the probability associated with 
the best model if there were other models of the same size inside the fence. The rationale is 
that if a model has no redundant variables then it will be the only model of that size inside 
the fence over a range of values of c. The result is more pronounced peaks which can help to 
determine the location of the correct peak and identify the optimal c* value or more clearly 
differentiate regions of model stability. This can be seen in the right hand panel of Figure 3. 

Another key difference is that our implementation is designed for linear and generalised linear 
models, rather than mixed models. As far as we are aware, this is the first time fence methods 
have been applied to such models. There is potential to add mixed model capabilities to future 
versions of the mplot package, but computational speed is a major hurdle that needs to be 
overcome. The current implementation is made computationally feasible through the use of 
the leaps and bestglm packages and the use of parallel processing, as discussed in Section 6 
(Lumley and Miller 2009; McLeod and Xu 2014). 

We have also provided an optional initial stepwise screening method that can help limit the 
range of c values over which to perform the adaptive fence procedure. The initial stepwise 
procedure performs forward and backward stepwise model selection using both the AIC and 
BIC. From the four candidate models, we extract the size of smallest and largest models, ki 
and kjj respectively. To obtain a sensible range of c values we consider the set of models 
with dimension between kn — 2 and kjj + 2. Due to the inherent limitations of stepwise 
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procedures, outlined in Section 2, it can be useful to check initial. stepwise = FALSE with 
a small number of bootstrap replications over a sparse grid of c values to ensure that the 
initial. stepwise = TRUE has produced a reasonable region. 


5. Interactive graphics 

To facilitate that researchers can more easily gain value from the static plots given in Figures 2 
and 3 and to help them interact with the model selection problem more closely, we have 
provided a set of interactive graphics based on the googleVis package and wrapped them in 
a shiny user interface. It is still quite novel for a package to provide a shiny interface for its 
methods, but there is precedent, see, for example McMurdie and Holmes (2013) or Gabry 
(2015). 

Among the most important contributions of these interactive methods is: the provision of 
tooltips to identify the models and/or variables; pagination of the legend for the variable 
inclusion plots; and a way to quickly select which variable to highlight in the model stability 
plots. These interactive plots can be generated when the plotO function is run on an af or 
vis object by specifying interactive=TRUE. 

The mplot package takes interactivity a step further, embedding these plots within a shiny 
web interface. This is done through a call to the mplot () function, which requires the full 
fitted model as the first argument and then a vis object and/or af object (in any order). 

fi> mplot(Im.art, vis.art, af.art) 

Note that the vis() and af () functions need to be run and the results stored prior to calling 
the mplot 0 function. The result of a call to this function is a webpage built using the shiny 
package with shinydashboard stylings (Chang et al. 2015; Chang 2015). Figure 4 shows a 
series of screen shots for the artificial example, equivalent to Figures 2 and 3, resulting from 
the above call to mplot (). 

The top panel of Figure 4 shows a model stability plot where the full model that does not 
contain xg has been selected and a tooltip has been displayed. It gives details about the 
model specification, the log-likelihood and the bootstrap selection probability within models 
of size 10. The tooltip makes it easier for users to identify which variables are included in 
dominant models than the static plot equivalent. On the left hand side of the shiny interface, 
a drop down menu allows users to select the variable to be highlighted. This is passed through 
the highlight argument discussed in Section 3.1. Models with the highlighted variable are 
displayed as red circles whereas models without the highlighted variable are displayed as blue 
circles. The ability for researchers to quickly and easily see which models in the stability plot 
contain certain variables enhances their understanding of the relative importance of different 
components in the model. Selecting “No” at the “Bootstrap?” radio buttons yields the plot 
of description loss against dimension shown in the top left panel of Figure 2. 

The middle panel of Figure 4 is a screen shot of an interactive variable inclusion plot. When 
the mouse hovers over a line, the tooltip gives information about the bootstrap inclusion 
probability and which variable the line represents. Note that in comparison to the bottom 
panel of Figure 2, the legend is now positioned outside of the main plot area. When the user 
clicks a variable in the legend, the corresponding line in the plot is highlighted. This can be 
seen in Figure 4, where the x% variable in the legend has been clicked and the corresponding 
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xs line in the variable inclusion plot has been highlighted. The highlighting is particularly 
useful with the redundant variable, so it can easily be identified. If the number of predictor 
variables is such that they no longer fit neatly down the right hand side of the plot, they 
simply paginate, that is an arrow appears allowing users to toggle through to the next page 
of variables. This makes the interface cleaner and easier to interpret than the static plots. 
Note also the vertical lines corresponding to traditional AIC and BIC penalty values. 

The bottom panel of Figure 4 is an interactive adaptive fence plot. The tooltip for a par¬ 
ticular point gives information about the explanatory variable(s) in the model, the a* = 
argmaxQ,g_ 4 p*(c, a) value and the {c,p*{c)) pair that has been plotted. Hovering or clicking 
on a model in the legend highlights all the points in the plot corresponding to that model. 
In this example, the xg legend has been clicked on and an additional circle has been added 
around all points representing the regression with xg as the sole explanatory variable. The 
shiny interface on the left allows users to toggle between best. only = TRUE and best. only 
= FALSE. 

The interactive graphics and shiny interface are most useful in the exploratory stage of model 
selection. Once the researcher has found the most informative plot through interactive anal¬ 
ysis, the more traditional static plots may be used in a formal write up of the problem. 


6. Timing 

Any bootstrap model selection procedure is time consuming. However, for linear models, we 
have leveraged the efficiency of the branch-and-bound algorithm provided by leaps (Miller 
2002; Lumley and Miller 2009). The bestglm package is used for GLMs; but in the absence 
of a comparably efficient algorithm the computational burden is much greater (McLeod and 
Xu 2014). 

Furthermore, we have taken advantage of the embarrassingly parallel nature of bootstrapping, 
utilising the doParallel and foreach packages to provide cross platform multicore support, 
available through the cores argument (Revolution Analytics and Weston 2014a,b). By default 
it will detect the number of cores available on your computer and leave one free. 

Figure 5 shows the timing results of simulations run for standard use scenarios with 4, 8 or 
16 cores used in parallel. Each observation plotted is the average of four runs of a given 
model size. The simulated models had a sample size of n = 100 with 5,10,..., 50 candidate 
variables, of which 30% were active in the true model. 

The results show both the vis() and af () functions are quite feasible on standard desktop 
hardware with 4 cores even for moderate dimensions of up to 40 candidate variables. The 
adaptive fence takes longer than the vis() function, though this is to be expected as the 
effective number of bootstrap replications is Bxn. c, where n. c is the number divisions in the 
grid of the parameter c. 

The results for GLMs are far less impressive, even when the maximum dimension of a can¬ 
didate solution is set to nvmax = 10. In its current implementation, the adaptive fence is 
only really feasible for models of around 10 predictors and the vis() function for 15. Fu¬ 
ture improvements could see approximations of the type outlined by Hosmer, Jovanovic, and 
Lemeshow (1989) to bring the power of the linear model branch-and-bound algorithm to 
GLMs. An example of how this works in practice is given in Section 7.2. 

An alternative approach for high dimensional models would be to consider subset selection 
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af(glm, B=50, n.c=25) 



vis(glm, B=50) 



Figure 5: Average time required to run the af() and vis() functions when n = 100. A 
binomial regression was used for the GLM example. 


with convex relaxations as in Shen, Pan, and Zhu (2012) or combine bootstrap model selection 
with regularisation. In particular, we have implemented variable inclusion plots and model 
stability plots for glmnet (Friedman, Hastie, and Tibshirani 2010). In general, this is very fast 
for models of moderate dimension, but it does not consider the full model space. Restrictions 
within the glmnet package, mean it is only applicable to linear models, binomial logistic 
regression, and Poisson regression with the log link function. The glnrnet package also allows 
for "multinomial", "cox", and "mgaussian" families, though we have not yet incorporated 
these into the mplot package. 


7. Real examples 


7.1. Diabetes example 

Table 1 shows a subset of the diabetes data used in Efron et al. (2004). There are 10 explana¬ 
tory variables, including age (age), sex (sex), body mass index (bmi) and mean arterial blood 
pressure (map) of 442 patients as well as six blood serum measurements (tc, Idl, hdl, tch, 
Itg and glu). The response is a measure of disease progression one year after the baseline 
measurements. 

Figure 6 shows the results of the main methods for the diabetes data obtained using the 
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Patient 

age 

sex 

bmi 

map 

tc 

Serum measurements 

Idl hdl tch Itg 

glu 

Response 

y 

1 

59 

2 

32.1 

101 

157 

93.2 

38 

4 

4.9 

87 

151 

2 

48 

1 

21.6 

87 

183 

103.2 

70 

3 

3.9 

69 

75 

3 

72 

2 

30.5 

93 

156 

93.6 

41 

4 

4.7 

85 

141 

441 

36 

1 

30.0 

95 

201 

125.2 

42 

5 

5.1 

85 

220 

442 

36 

1 

19.6 

71 

250 

133.2 

97 

3 

4.6 

92 

57 


Table 1: Measurements on 442 diabetes patients over 10 potential predictor variables and the 
response variable, a measure of disease progression after one year. 


following code. 


R> Im.d = lin(y ~ data = diabetes) 

R> vis.d = visdm.d, B = 200) 

R> af.d = afdm.d, B = 200, n.c = 100, c.max 
R> plot(vis.d, interactive = FALSE, which = 
R> plot(vis.d, interactive = FALSE, which = 

+ highlight = "hdl") 

R> plot(af.d, interactive = FALSE, best.only 
+ legend.position = "bottomright") 

R> plot(af.d, interactive = FALSE, best.only 


= 100 ) 

"vip") 

"boot", max.circle 
= TRUE, 

= FALSE) 


0.25, 


A striking feature of the variable inclusion plot is the non-monotonic nature of the hdl line. 
As the penalty value increases, and a more parsimonious model is sought, the hdl variable is 
selected more frequently while at the same time other variables with similar information are 
dropped. Such paths occur when a group of variables contains similar information to another 
variable. The hdl line is a less extreme example of what occurs with xg in the artificial 
example (see Figure 2). The path for the age variable lies below the path for the redundant 
variable, indicating that it does not provide any useful information. The bmi and Itg paths 
are horizontal with a bootstrap probability of 1 for all penalty values indicating that they are 
very important variables, as are map and sex. From the variable inclusion plot alone, it is not 
obvious whether tc or hdl is the next most important variable. Some guidance on this issue 
is provided by the model stability and adaptive fence plots. 

In order to determine which circles correspond to which models in the static version of the 
bootstrap stability plot, we need to consult the print output of the vis object. 


R> vis.d 


name prob logLikelihood 


y~l 1.00 
y~bmi 0.73 
y~bmi+ltg 1.00 
y~bmi+map+ltg 0.69 


-2547.17 

-2454.02 

-2411.20 

-2402.61 
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y~bmi+map+tc+ltg 

y~bmi+map+hdl+ltg 

y~sex+bmi+map+hdl+ltg 

y~sex+bmi+niap+tc+ldl+ltg 


0.42 

-2397.48 

0.32 

-2397.71 

0.67 

-2390.13 

0.47 

-2387.30 


As in the variable inclusion plots, it is clear that the two most important variables are bmi 
and Itg, and the third most important variable is map. In models of size four (including the 
intercept), the model with bmi, Itg and map was selected in 69% of bootstrap resamples. 
There is no clear dominant model in models of size five, with tc and hdl both competing to 
be included. In models of size six, the combination of sex and hdl with the core variables 
bmi, map and Itg, is the most stable option; it is selected in 67% of bootstrap resamples. As 
the size of the model space in dimension six is much larger than the size of the model space 
for dimension four, it could be suggested that the 0.67 empirical probability for the {bmi, 
map, Itg, sex, hdl} model is a stronger result than the 0.69 result for the {bmi, Itg, map} 
model. 

The adaptive fence plots in the bottom row of Figure 6 show a clear peak for the model with 
just bmi and Itg. There are two larger models that also occupy regions of stability, albeit 
with much lower peaks. These are {bmi, map, Itg} and {bmi, map, Itg, sex, hdl} which also 
showed up as dominant models in the model stability plots. Contrasting best. only = TRUE 
in the lower left panel with best. only = FALSE in the lower right panel, we can see that the 
peaks tend to be more clearly distinguished, though the regions of stability remain largely 
unchanged. 

Stepwise approaches using a forward search or backward search with the AIC or BIC all yield 
a model with {bmi, map, Itg, sex, Idl, tc}. This model was selected 47% of the time in 
models of size 7. The agreement between the stepwise methods may be comforting for the 
researcher, but it does not aid a discussion about what other models may be worth exploring. 

An interactive version of the plots in Figure 6 is available at garthtarr.com/apps/mplot. 

To incorporate interaction terms, we suggest selecting the main effects first, then regressing 
the relevant interaction terms on the residuals from the main effects model. This approach 
ensures that the main effects are always taken into account. In this example, we estimate the 
dominant model of dimension six and obtain the fitted residuals. The interaction terms are 
then regressed on the fitted residuals. 


fi> Im.d.main = lm(y ~ sex + bmi + map + hdl + Itg, data = diabetes) 

R> summary(lm.d.main) 

R> db.main = diabetes[, c("sex", "bmi", "map", "hdl", "Itg")] 

R> db.main$y = lm.d.main$residuals 

R> Im.d.int = lm(y ~ - sex - bmi - map - hdl - Itg, data = db.main) 

R> vis.d.int = vis(lm.d.int, B = 200) 

R> af.d.int = af(Im.d.int, B = 200, n.c = 100) 

R> vis.d.int 

name prob logLikelihood 

y~l 1.00 -2390.13 

y~bmi.map+bmi.hdl+map.ltg+hdl.ltg 0.56 -2385.89 


Bootstrapped probability p* Bootstrapped probability 
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Figure 6: Diabetes main effects example. 
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The result can be found in Figure 7. The variable inclusion plots suggest that the most im¬ 
portant interaction terms are hdl.ltg, bmi.hdl, map.ltg and bmi.map. The model stability 
plot suggests that there are no dominant models of size 2, 3 or 4. Furthermore there are no 
models of size 2, 3 or 4 that make large improvements in description loss. There is a dominant 
model of dimension 5 that is selected in 56% of bootstrap resamples. The variables selected 
in the dominant model are {bmi.map, bmi.hdl, map.ltg, hdl.ltg}, which can be found in 
the print output above. Furthermore, this model does make a reasonable improvement in 
description loss, almost in line with the full model. This finding is reinforced in the adaptive 
fence plots where there are only two regions of stability, one for the null model and another for 
the {bmi.map, bmi.hdl, map.ltg, hdl.ltg} model. In this instance, the difference between 
best. only = TRUE and best. only = FALSE is minor. 

Hence, as a hnal model for the diabetes example we suggest including the main effects {bmi, 
map, Itg, sex, hdl} and the interaction effects {bmi.map, bmi.hdl, map.ltg, hdl.ltg}. 
Further investigation can also be useful. For example, we could use cross validation to compare 
the model with interaction effects, the model with just main effects and other simpler models 
that were identified as having peaks in the adaptive fence. Researchers should also incorporate 
their specialist knowledge of the predictors and evaluate whether or not the estimated model 
is sensible from a scientihc perspective. 

7.2. Birth weight example 

The second example is the birthwt dataset from the MASS package which has data on 189 
births at the Baystate Medical Centre, Springfield, Massachusetts during 1986 (Venables and 
Ripley 2002). The main variable of interest is low birth weight, a binary response variable low 
(Hosmer and Lemeshow 1989). We have taken the same approach to modelling the full model 
as in Venables and Ripley (2002, pp. 194-197), where ptl is reduced to a binary indicator of 
past history and ftv is reduced to a factor with three levels. 

R> require(MASS) 

R> hwt <- with(birthwt, { 

+ race <- factor(race, labels = cC'white", "black", "other")) 

+ ptd <- factor(ptl > 0) 

+ ftv <- factor(ftv) 

+ levels(ftv)[-(1:2)] <- "2+" 

+ data.frame(low = factor(low), age, Iwt, race, smoke = (smoke > 0), 

+ ptd, ht = (ht > 0), ui = (ui > 0), ftv) 

+ }) 

R> options(contrasts = cC'contr.treatment", "contr.poly")) 

R> bw.glm <- glmdow ~ ., family = binomial, data = bwt) 

R> round(summary(bw.glm)$coef, 2) 


Estimate Std. Error z value Pr(>|z|) 


(Intercept) 

0. 

CO 

NO 

1, 

.24 

0. 

,66 

0, 
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LO 
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-0. 

.04 

0, 

.04 

-0. 

,96 

0, 

.34 
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-0. 

.02 

0, 

.01 

-2. 

,21 

0. 

.03 

raceblack 
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.19 

0, 

.54 

2. 

,22 

0. 

.03 
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0. 
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0. 
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0, 
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0. 
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The vis and af objects are generated using the htted full model object as an argument to the 
vis() and af () functions. The results are shown in Figure 8, where screenshots have been 
taken of the interactive plots because they display the larger set of variables more clearly than 
the static plot methods. 

R> af.bw = af(bw.glm, B = 150, c.max = 20, n.c = 40) 

R> vis.bw = visCbw.glm, B = 150) 

R> plot(vis.bw, which = "vip") 

R> plot(vis.bw, which = "boot", highlight = "htTRUE") 


R> plot(af.bw) 

R> print(vis.bw, min.prob = 0.15) 

name prob logLikelihood 
low'l 1.00 -117.34 

low~ptdTRUE 0.53 -110.95 

low~age+ptdTRUE 0.15 -108.65 

low~lwt+ptdTRUE+htTRUE 0.16 -105.06 


In this example, it is far less clear which is the best model, or if indeed a “best model” exists. 
All the curves in the variable inclusion plot he above the redundant variable curve, with 
ftv2+ the least important variable. It is possible to infer an ordering of variable importance 
from the variable inclusion plots, but there is no clear cutoff as to which variables should be 
included and which should be excluded. This is also clear in the model stability plots, where 
apart from the bivariate regression with ptd, there are no obviously dominant models. 

In the adaptive fence plot, the only model more complex than a single covariate regression 
model that shows up with some regularity is the model with Iwt, ptd and ht, though at such 
low levels, it is just barely a region of stability. This model also stands out slightly in the 
model stability plot, where it is selected in 16% of bootstrap resamples and has a slightly 
lower description loss than other models of the same dimension. It is worth recalling that the 
bootstrap resamples generated for the adaptive fence are separate from those generated for the 
model stability plots. Indeed the adaptive fence procedure relies on a parametric bootstrap, 
whereas the model stability plots rely on an exponential weighted bootstrap. Thus, to hnd 
some agreement between these methods is reassuring. 

Stepwise approaches using AIC or BIC yield conflicting models, depending on whether the 
search starts with the full model or the null model. As expected, the BIC stepwise approach 
returns smaller models than AIC, selecting the single covariate logistic regression, low 
ptd, in the forward direction and the larger model, low ~ Iwt + ptd + ht when stepping 
backwards from the full model. Forward selection from the null model with the AIC yielded 
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Figure 8: Birth weight example. 
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low ~ ptd + age + ht + Iwt + ui whereas backward selection the slightly larger model, 
low ~ Iwt + race + smoke + ptd + ht + ui. Some of these models appear as features in 
the model stability plots. Most notably the dominant single covariate logistic regression and 
the model with Iwt, ptd and ht identified as a possible region of stability in the adaptive 
fence plot. The larger models identihed by the AIC are reflective of the variable importance 
plot in that they show there may still be important information contained in a number of 
other variables not identified by the BIC approach. 

Calcagno and de Mazancourt (2010) also consider this data set, but they allow for the possi¬ 
bility of interaction terms. Using their approach, they identify “two” best models 


low ~ smoke + ptd + ht + ui + ftv + age + Iwt + ui:smoke + ftv:age 

low ~ smoke + ptd + ht + ui + ftv + age + Iwt + ui:smoke + ui:ht + ftv:age 


As a general rule, we would warn against the . *. approach, where all possible interaction 
terms are considered, as it does not consider whether or not the interaction terms actually 
make practical sense. Calcagno and de Mazancourt (2010) conclude that “Having two best 
models and not one is an extreme case where taking model selection uncertainty into account 
rather than looking for a single best model is certainly recommended!” The issue here is that 
the software did not highlight that these models are identical as the ui: ht interaction variable 
is simply a vector of ones, and as such, is ignored by the GLM fitting routine. 


As computation time can be an issue for GLMs, it is useful to approximate the results using 
weighted least squares (Hosmer et al. 1989). In practice this can be done by fitting the logistic 
regression and extracting the estimated logistic probabilities, vti. A new dependent variable 
is then constructed. 


= log 


TTj 

l-TTi 


Vi - TTj 
7ri(l - TTi) ’ 


along with observation weights Vi = 7rj(l — vrj). For any submodel a this approach produces 
the approximate coefficient estimates of Lawless and Singhal (1978) and enables us to use the 
leaps package to perform the computations for best subsets logistic regression as follows. 


R> pihat = bw.glniSfitted. values 
R> r = hw.glm$residuals 
R> z = logCpihat/(1 - pihat)) + r 
R> V = pihat*(l - pihat) 

R> nbwt = bwt 
R> nbwt$z = z 
R> nbwt$low = NULL 

R> bw.lm = lm(z ~ data = nbwt, weights = v) 

R> bw.lm.vis = vis(bw.lm, B = 150) 

R> bw.lm.af = af(bw.lm, B = 150 c.max = 20, n.c = 40) 

R> plot(bw.lm.vis, which = "vip") 

R> plot(bw.lm.vis, which = "boot", highlight = "htTRUE") 

R> plot(bw.Im.af) 

The coefficients from bw.lm are identical to bw.glm. This approximation provides similar 
results, shown in Figure 9, in a fraction of the time. 




•2 ‘Log-tikelihood Bootstrapped probability 


Garth Tarr, Samuel Mueller, Alan H. Welsh 


23 


Variable inclusion plot 



910 


•i 

: 

► 

9 

• i 

i 

i» 

» 








% 

*i 

^ 1 

* ^ 

i 1 i , 

• I 4 s - 


i» 







1 


in 

• j 


% 

[ ^ 
k A 

1 t j 

Ik 












»• 


1 23456789 10 11 


■ ptdTRUE 

■ htlRUE 

■ smokeTRUE 

■ Iwt 

■ age 

■ nvi 

■ raceother 

■ raceblack 

■ uiTRUE 

■ ftv2. 

■ RV 


■ Without age 

■ With age 


Number of parameters 


Adaptive fence: c*=19.5 













• 

• 

••• 



• « 

•• 

• 

•• 


• 



0 5 10 15 20 


1 

age + Iwt + raceblack + raceother + 

smokeTRUE + ptdTRUE + htTRUE + ui... 

Iwt + ptdTRUE + htTRUE 

Iwt + ptdTRUE + htTRUE + ftvl 

Iwt + raceblack + ptdTRUE + htTRUE 

Iwt + raceblack + ptdTRUE + htTRUE + 

nv 1 

Iwt + raceblack + raceother + ptdTRUE + 
htTRUE 

Iwt + raceblack + raceother + smokeTRUE 

+ ptdTRUE + htTRUE 

Iwt + raceblack + raceother smokeTRUE 

+ ptdTRUE + htTRUE + uiTRUE 

Iwt + raceblack + raceother smokeTRUE 

+ ptdTRUE + htTRUE + uiTRUE + ftvl 

ptdTRUE 

ptdTRUE + htTRUE 


c 


Figure 9: Birth weight example with linear model approximation. 
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8. Conclusion 

In the rejoinder to their least angle regression paper, Efron et al. (2004) comment, 


“In actual practice, or at least in good actual practice, there is a cycle of activity 
between the investigator, the statistician and the computer ... The statistician 
examines the output critically, as did several of our commentators, discussing the 
results with the investigator, who may at this point suggest adding or removing 
explanatory variables, and so on, and so on.” 


We hope the suite of methods available in the mplot package adds valuable information to this 
cycle of activity between researchers and statisticians. In particular, providing statisticians 
and researchers alike with a deeper understanding of the relative importance of different 
models and the variables contained therein. 

In the artificial example, we demonstrated a situation where giving the researcher more infor¬ 
mation in a graphical presentation can lead to choosing the “correct” model when standard 
stepwise procedures would have failed. 

The diabetes data set suggested the existence of a number of different dominant models at 
various model sizes which could then be investigated further, for example, statistically using 
cross validation to determine predictive ability, or in discussion with researchers to see which 
makes the most practical sense. In contrast, there are no clear models suggested for the birth 
weight example. The adaptive fence has no peaks, nor is there a clearly dominant model in 
the model stability plot even though all but one variable are more informative than the added 
redundant variable in the variable inclusion plot. 

While the core of the mplot package is built around exhaustive searches, this becomes com¬ 
putationally infeasible as the number of variables grows. We have implemented similar vi¬ 
sualisations to model stability plots and variable inclusion plots for glmnet which brings the 
concept of model stability to much larger model sizes, though it will no longer be based around 
exhaustive searches. 

The graphs provided by the mplot package are a major contribution. A large amount of 
information is generated by the various methods and the best way to interpret that information 
is through effective visualisations. For example, as was be shown in Section 7.1, the path a 
variable takes through the variable inclusion plot is often more important than the average 
inclusion probability over the range of penalty values considered. It can also be instructive 
to observe when there are no peaks in the adaptive fence plot as this indicates that the 
variability of the log-likelihood is limited and no single model stands apart from the others. 
Such a relatively flat likelihood over various models would also be seen in the model stability 
plot where there was no dominant model over the range of model sizes considered. 

Although interpretation of the model selection plots provided here is something of an “art”, 
this is not something to shy away from. We accept and train young statisticians to interpret 
qq-plots and residual plots. There is a wealth of information in our plots, particularly the 
interactive versions enhanced with the shiny interface, that can better inform a researchers’ 
model selection choice. 
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